Bellabeat is a high-tech manufacturer of health-focused products for women.
Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.
Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy
1. Identify potential opportunities for growth
2. Recommendations for the Bellabeat marketing strategy improvement based on trends in smart device usage.
Data set = Fitbit Fitness Tracker Data made available by Mobius stored on Kaggle.
Legalities = This dataset is under CC0: Public Domain license i.e. the creator has waived their right to the work under copyright law.
1. Reliability = LOW - only 30 participants data collected with a number of unknowns apparent such as age and gender.
2. Originality = LOW - this data is Fitbit Fitness Tracker Data made available by Mobius stored on Kaggle, originally collected using Amazon Mechanical Turk.
3. Comprehensive = MEDIUM - multiple fields and pieces of information are available but age and gender are not included and it is noted that only a totla of 2 logs of "Fat" and 67 of Weight are noted. There is also no mention of hydration logging in terms of Bellabeat's interest in marketability of their Spring product.
4. Current = LOW - this data set is now 7 years old and there have been significant improvements and changes in habits within that time, especially during and after the pandemic.
5. Cited = HIGH - the source is a highly data collector and the source is well documented.
I will largely focus on the daily use of the device to obtain an overall view of its use but I will also dive into some hourly usage in order to identify specific trends.
I will use R Studio as my primary tool as it has the capabilities to store and manage large datasets and work between them all at once; it also provides visualization tools all within the one environment.
I will now set up my environment, load and clean the data.
Each of my code chunks will run automatically; I have excluded some outputs due to the length of them.
First I will set up my environment by loading the necessary packages for this task.
library(tidyverse)
library(ggplot2)
library(readr)
library(here)
library(dplyr)
library(hms)
library(lubridate)
library(shiny)
library(tidyr)
library(plotly)
library(skimr)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.2 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.0 ✔ ggplot2 3.4.2 ✔ tibble 3.2.1 ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ✔ purrr 1.0.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors here() starts at /kaggle/working Attaching package: ‘hms’ The following object is masked from ‘package:lubridate’: hms Attaching package: ‘plotly’ The following object is masked from ‘package:ggplot2’: last_plot The following object is masked from ‘package:stats’: filter The following object is masked from ‘package:graphics’: layout The following object is masked from ‘package:httr’: config
I will check my working directory is set correctly.
getwd()
Now I will upload the csv files required and rename them as appropriate to achieve uniformity.
raw_act_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
raw_cal_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
raw_int_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
raw_step_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
raw_sleep_day <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
raw_cal_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')
raw_int_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
raw_step_hour <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
raw_hr_sec <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv')
raw_weight <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
raw_cal_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteCaloriesNarrow_merged.csv')
raw_int_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteIntensitiesNarrow_merged.csv')
raw_met_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteMETsNarrow_merged.csv')
raw_sleep_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteSleep_merged.csv')
raw_step_min <- read.csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/minuteStepsNarrow_merged.csv')
Now I will check all of the data for correctness and uniformity; I will reformat as appropriate.
# In order to check my data, I used the following code for each as appropriate; I will not execute them here due to the length of the result.
#str(raw_act_day)
#colnames(raw_act_day)
#rownames(raw_act_day)
#head(raw_act_day)
#skim_without_charts(raw_act_day)
#glimpse(raw_act_day)
# note that column 2 is string instead of date
act_day <- raw_act_day %>%
mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y"))
# have now noted that date columns are named differently in each data frame; I
# will rename for uniformity
act_day <- act_day %>%
rename(Date = ActivityDate)
# now drop na
act_day <- na.omit(act_day)
# repeat process with all other data frames showing similarly
cal_day <- raw_cal_day %>%
mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
rename(Date = ActivityDay) %>%
na.omit()
# on this one, I note that "ActivityHour" contains both date and time and are in # character format, I will separate this into two columns and reformat
cal_hour <- raw_cal_hour
# then
cal_hour$ActivityHour <-
as.POSIXct(raw_cal_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate ActivityHour into Date and Time columns
cal_hour <- cal_hour %>%
separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
na.omit()
# repeat process with all other data frames showing similarly
cal_min <- raw_cal_min
# then
cal_min$ActivityMinute <-
as.POSIXct(raw_cal_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate ActivityMinute into Date and Time columns
cal_min <- cal_min %>%
separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
na.omit()
hr_sec <- raw_hr_sec
# then
hr_sec$Time <-
as.POSIXct(raw_hr_sec$Time, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
hr_sec <- hr_sec %>%
separate(Time, into = c("Date", "Time"), sep = " ") %>%
na.omit()
int_day <- raw_int_day %>%
mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
rename(Date = ActivityDay) %>%
na.omit()
int_hour <- raw_int_hour
# then
int_hour$ActivityHour <-
as.POSIXct(raw_int_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
int_hour <- int_hour %>%
separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
na.omit()
int_min <- raw_int_min
# then
int_min$ActivityMinute <-
as.POSIXct(raw_int_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
int_min <- int_min %>%
separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
na.omit()
met_min <- raw_met_min
# then
met_min$ActivityMinute <-
as.POSIXct(raw_met_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
met_min <- met_min %>%
separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
na.omit()
# on this one, I note that "SleepDay" contains both date and time and are in
# character format; however, time appears to be irrelevant as every single entry is noted as 12:00:00AM so I will separate this into two columns and reformat, then remove time altogether
sleep_day <- raw_sleep_day
# then
sleep_day$SleepDay <-
as.POSIXct(raw_sleep_day$SleepDay, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
sleep_day <- sleep_day %>%
separate(SleepDay, into = c("Date", "Time"), sep = " ") %>%
select(-Time)
# there's a lot of NA so we won't omit NA on this one
sleep_min <- raw_sleep_min
# then
sleep_min$date <-
as.POSIXct(raw_sleep_min$date, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
sleep_min <- sleep_min %>%
separate(date, into = c("Date", "Time"), sep = " ") %>%
na.omit()
step_day <- raw_step_day %>%
mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%Y")) %>%
rename(Date = ActivityDay) %>%
na.omit()
step_hour <- raw_step_hour
# then
step_hour$ActivityHour <-
as.POSIXct(raw_step_hour$ActivityHour, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
step_hour <- step_hour %>%
separate(ActivityHour, into = c("Date", "Time"), sep = " ") %>%
na.omit()
step_min <- raw_step_min
# then
step_min$ActivityMinute <-
as.POSIXct(raw_step_min$ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
step_min <- step_min %>%
separate(ActivityMinute, into = c("Date", "Time"), sep = " ") %>%
na.omit()
weight <- raw_weight
# then
weight$Date <-
as.POSIXct(raw_weight$Date, format = "%m/%d/%Y %I:%M:%S %p")
# Separate Time into Date and Time columns
weight <- weight %>%
separate(Date, into = c("Date", "Time"), sep = " ")
# there's a lot of NAs, so we won't omit NA on this one
Warning message: “Expected 2 pieces. Missing pieces filled with `NA` in 413 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”
#colnames(cal_day)
This allowed me to choose the most appropriate data frames to place together for analysis:
# List of data frames
list_merge_day <- list(cal_day, int_day,
step_day)
# Use left_join to combine data frames
merge_day <- Reduce(function(x, y) merge(x, y, by = c('Id', 'Date')),
list_merge_day)
# [act_day omitted as it essentially covers all of the above already]
# [sleep_day omitted due to vast different amount of observations, but it is still important to analyse so I will create a separate data frame for it]
merge_act_sleep <- merge(act_day, sleep_day, by=c('Id', 'Date'))
# *******************************************
# List of data frames
list_hour <- list(cal_hour, int_hour, step_hour)
# Use left_join to combine data frames
merge_hour <- reduce(list_hour, left_join, by = c('Id', 'Date', 'Time'))
# *******************************************
# I then figured out a much easier way to do this #d'oh!
merge_cal_weight <- merge(act_day, weight, by=c('Id', 'Date'))
# ******************************************
merge_int_cal <- merge(int_day, cal_day, by=c('Id', 'Date'))
# hours asleep is easier to comprehend than minutes asleep so I will add this
# column using an equation
merge_act_sleep$TotalHoursAsleep <-
merge_act_sleep$TotalMinutesAsleep / 60
# repeat as necessary below
# **********************************
merge_act_sleep$NonActiveMinutes <-
merge_act_sleep$LightlyActiveMinutes + merge_act_sleep$SedentaryMinutes
merge_act_sleep$NonActiveHours <-
(merge_act_sleep$LightlyActiveMinutes + merge_act_sleep$SedentaryMinutes) / 60
merge_act_sleep$ActiveMinutes <-
merge_act_sleep$VeryActiveMinutes + merge_act_sleep$FairlyActiveMinutes
merge_act_sleep$ActiveHours <-
(merge_act_sleep$VeryActiveMinutes + merge_act_sleep$FairlyActiveMinutes) / 60
#**********************************************
merge_day$NonActiveMinutes <-
merge_day$LightlyActiveMinutes + merge_day$SedentaryMinutes
merge_day$NonActiveHours <-
(merge_day$LightlyActiveMinutes + merge_day$SedentaryMinutes) / 60
merge_day$ActiveMinutes <-
merge_day$VeryActiveMinutes + merge_day$FairlyActiveMinutes
merge_day$ActiveHours <-
(merge_day$VeryActiveMinutes + merge_day$FairlyActiveMinutes) / 60
# **********************************************
#filter out any ID that contains less than 6 data points in Weight,
# as this will reduce ability to conduct meaningful analysis
merge_cal_weight_filtered <- merge_cal_weight %>%
group_by(Id) %>%
filter(n() >= 6) %>%
ungroup()
# Convert Id to a factor with modified levels for labeling
merge_cal_weight_filtered$Id <- factor(merge_cal_weight_filtered$Id, levels =
unique(merge_cal_weight_filtered$Id))
# print list of all data frames in environment
# (commented off due to length)
# print(ls())
# List of data frames
list_all_dfs <- list(
act_day = act_day,
cal_day = cal_day,
cal_hour = cal_hour,
cal_min = cal_min,
hr_sec = hr_sec,
int_day = int_day,
int_hour = int_hour,
int_min = int_min,
met_min = met_min,
sleep_day = sleep_day,
sleep_min = sleep_min,
step_day = step_day,
step_hour = step_hour,
step_min = step_min,
weight = weight
)
# Calculate the number of unique IDs in each data frame
count_unique_id <- sapply(list_all_dfs, function(df) n_distinct(df$Id))
# Create a data frame with the results
df_unique_ids <- data.frame(
Data_Frame = names(count_unique_id),
Unique_Count = count_unique_id
)
# Reset row names
rownames(df_unique_ids) <- NULL
# View results
View(df_unique_ids)
# Convert 'Data_Frame' to a factor
df_unique_ids$Data_Frame <- factor(df_unique_ids$Data_Frame, levels =
df_unique_ids$Data_Frame)
# Create a bar chart
chart_df_unique_ids <- ggplot(data = df_unique_ids, aes(x = Data_Frame,
y = Unique_Count)) +
geom_bar(stat = "identity", fill = "dodgerblue") +
labs(title = "Count of Unique IDs in Each Data Frame",
x = "Data Frame",
y = "Unique Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
chart_df_unique_ids
ggsave('chart_df_unique_ids.png',width=16,height=8)
| Data_Frame | Unique_Count |
|---|---|
| <chr> | <int> |
| act_day | 33 |
| cal_day | 33 |
| cal_hour | 33 |
| cal_min | 33 |
| hr_sec | 14 |
| int_day | 33 |
| int_hour | 33 |
| int_min | 33 |
| met_min | 33 |
| sleep_day | 24 |
| sleep_min | 24 |
| step_day | 33 |
| step_hour | 33 |
| step_min | 33 |
| weight | 8 |
We can see that 33 IDs use the majority of the functions but the following functions are used less:
1. Sleep by minute and by day (24 IDs)
2. Heartrate per second (14 IDs)
3. Weight logging is the least used (8 IDs)
As only 8 persons have logged their weight, this could indicate that:
1. Users have a negative relationship with weight logging and the feelings that it evokes.
2. Users do not have scales
3. The logging process is difficult in the application
Therefore I suggest trying to positively inform customers about a healthy relationship with weight logging.
Other fitness applications have introduced smart scales, which can automatically send the data to the main application, which would assist the user in easy logging and also potentially open up another marketable range.
I note no mention of the use of any water logging data, which leads me to hypothesize that either:
1. This data is missing.
2. This fitness company do not have an option for this, whereas Bellabeat do with "Spring", thus highlighting a potential a gap in the market that Bellabeat can take advantage of.
3. There is that option but no one is availing of it, thus suggesting that there is no market in the gap.
Either way, more research is needed in regard to the marketability for the logging of hydration throughout the day.
The information on this data set states that the data provided ranges between 12th March and 12th May 2016; let's check.
range(act_day$Date)
range(cal_day$Date)
range(int_day$Date)
range(sleep_day$Date)
range(step_day$Date)
range(weight$Date)
All return 2016-04-12 to 2016-05-12 i.e. 12th April to 12th May 2016, so we have only one month as opposed to two, thus meaningful data analysis may be constrained further.
Let's summarize some of the data, to give us a quick idea of what we're looking at. I used the code colnames() to establish which column names to use for each summary.
summary_act <- act_day %>%
mutate(SedentaryHours = SedentaryMinutes / 60) %>%
select(TotalSteps,
SedentaryHours,
Calories) %>%
summary()
print(summary_act)
# ************************************************
summary_sleep <- sleep_day %>%
mutate(TotalHoursAsleep = TotalMinutesAsleep / 60) %>%
select(TotalHoursAsleep) %>%
summary()
print(summary_sleep)
# ************************************************
summary_weight <- weight %>%
select(WeightKg,
BMI,
Fat) %>%
summary()
print(summary_weight)
# ************************************************
summary_activity_minutes <- int_day %>%
select(VeryActiveMinutes,
FairlyActiveMinutes,
LightlyActiveMinutes,
SedentaryMinutes) %>%
summary()
print(summary_activity_minutes)
TotalSteps SedentaryHours Calories
Min. : 0 Min. : 0.00 Min. : 0
1st Qu.: 3790 1st Qu.:12.16 1st Qu.:1828
Median : 7406 Median :17.62 Median :2134
Mean : 7638 Mean :16.52 Mean :2304
3rd Qu.:10727 3rd Qu.:20.49 3rd Qu.:2793
Max. :36019 Max. :24.00 Max. :4900
TotalHoursAsleep
Min. : 0.9667
1st Qu.: 6.0167
Median : 7.2167
Mean : 6.9911
3rd Qu.: 8.1667
Max. :13.2667
WeightKg BMI Fat
Min. : 52.60 Min. :21.45 Min. :22.00
1st Qu.: 61.40 1st Qu.:23.96 1st Qu.:22.75
Median : 62.50 Median :24.39 Median :23.50
Mean : 72.04 Mean :25.19 Mean :23.50
3rd Qu.: 85.05 3rd Qu.:25.56 3rd Qu.:24.25
Max. :133.50 Max. :47.54 Max. :25.00
NA's :65
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
The median amount of steps taken daily is 7,406 with the maximum at 36,019!
The median calories burned per day is 2,134 with the max at 4,900.
The median sedentary hours per day is over 17.5 hours! Given that we sleep for 6-8 hours on average, that is 9 further hours spent sedentary - could this be while at work in a desk job?
Alert reminder to move during work.
The median hours asleep is just over 7 hours per day with the max at 13 hours, and the min at under an hour!
Add calm reminders for bed - maybe team up with "Calm" or similar.
The median Weight in Kg is 62kg, with BMI at 24.39 and Fat at 23.5; healthy BMI is stated as being between 18.5-24.9; however, this is not always fool-proof due to different sports e.g. Strongman/Body Builder.
Healthy fat range for men is 18-24% and for women is 25-31%; while we do not know the genders of our users, the average is within healthy range. However, only 2 entries total were made in "Fat".
Provide more information/education about this to customers and perhaps introduce a gadget into the BB range to assist users in calculating this.
Very Active Minutes, median = 4
Fairly Active Minutes, median = 6
Lightly Active Minutes, median = 199 (3.3 hours)
Sedentary Minutes, median = 1,057 (17.6 hours)
Whilst 17.6 hours seems excessive, the median sleep is just over 7 hours and an assumption of mine is an 8 hour office day, which would take us to over 15 hours.
Add occupation type (admin, labor, etc.) to assist with analysis and accountability for customers. Encourage movement alerts during workday.
I am going to continue my analysis, but will use visualization alongside this now as we go along.
I'm going to analyse the activity throughout the day to try to establish the most commonly active parts of users' days.
# Reformat Date as date and Time as hms
merge_hour <- merge_hour %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y"),
Time = as.POSIXct(Time, format = "%H:%M:%S"))
# Get Mean of Total Intensity
merge_hour_int <- merge_hour %>%
group_by(Time) %>%
summarise(int_mean = mean(TotalIntensity))
# Create the bar chart
chart_int_day <-
ggplot(data = merge_hour_int) +
geom_bar(mapping = aes(x = Time, y = int_mean),
stat = "identity", fill = "#40a8a8") +
labs(title = "Mean Intensity Throughout the Day",
x = "Time of Day",
y = "Mean Intensity") +
scale_x_datetime(breaks = scales::date_breaks("1 hour"),
labels = scales::date_format("%H:%M")) +
theme_minimal() +
theme(
plot.background = element_rect(fill = "white"),
panel.background = element_rect(fill = "white"),
text = element_text(color = "black"),
axis.text.x = element_text(angle = 25, hjust = 1)
)
chart_int_day
ggsave('chart_int_day.png',width=16,height=8)
From this, we can see that people appear to be most active between 4pm and 6pm, with 11am-1pm following closely behind.
I want to see if there is a correlation between the amount of hours spent sleeping and the amount of calories burned.
The pink plots signify the data points of Total Hours Asleep vs Calories Burned, the Green line signifies the linear model and the blue line signifies the locally weighted scatterplot smoothing.
chart_cal_sleep <-
ggplot(data=merge_act_sleep) +
geom_point(mapping=aes(x=Calories,y=TotalHoursAsleep), color='pink') +
geom_smooth(mapping = aes(x = Calories, y = TotalHoursAsleep), color='green',
method = 'lm') +
geom_smooth(mapping = aes(x = Calories, y = TotalHoursAsleep), method =
'loess') +
labs(title='Total Calories Burned vs Total Hours Asleep',
caption='12th April-12th May 2016') +
scale_x_continuous(name = "Calories Burned") +
# adjust the scale as currently only shows "5" and "10"
scale_y_continuous(
name = "Total Hours Asleep",
breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2),
labels = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2))
ggsave('chart_cal_sleep.png',width=16,height=8)
# Convert ggplot to interactive plotly
charti_cal_sleep <- ggplotly(chart_cal_sleep)
# Display the interactive plot
charti_cal_sleep
ggsave('charti_cal_sleep.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
It does not appear there is a correlation between calories burned and total hours asleep; let's check by calculating the correlation coefficient between the two variables to quantitatively measure the strength and direction of the linear relationship.
cor_cal_sleep <- cor(merge_act_sleep$Calories,
merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_cal_sleep, "\n")
Correlation Coefficient: -0.02852571
Correlation Coefficient: -0.02852571 - this indicates a correlation so weak that no meaningful conclusion can be drawn.
It does not appear there is a correlation between calories burned and total hours asleep; let's see if there's a correlation in sleep quality against different activity levels.
I have highlighted 6-8 hours as this is advised as the ideal amount of sleep per night; additionally, the median sleep achieved by this group we saw was just over 7 hours.
chart_sleep_sed <-
ggplot(data = merge_act_sleep,
aes(x = NonActiveHours, y = TotalHoursAsleep)) +
geom_point(color = 'pink') +
geom_smooth(method = 'lm', se = FALSE) +
geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 6, ymax = 8),
fill = "yellow", alpha = 0.006) + # More transparent highlight
labs(title = 'Total Non Active Hours vs Total Hours Asleep',
caption = '12th April - 12th May 2016',
x = "Non Active Hours",
y = "Total Hours Asleep") +
scale_y_continuous(breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2)) +
theme_minimal()
chart_sleep_sed
ggsave('chart_sleep_sed.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
This appears to show negative correlation; let's check:
cor_nonactive_sleep <- cor(merge_act_sleep$NonActiveHours,
merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_nonactive_sleep, "\n")
Correlation Coefficient: -0.5825874
Correlation Coefficient: -0.5825874 - this indicates a moderate negative correlation so the more Total Hours Asleep moderately corresponds to less Non Active Hours.
It appears that those within the 6-8 hours largely fall between 12.5 to 17.5 hours of non/low activity per day, thus 6.5-11.5 hours of fairly/very active time.
Minus the 6-8 hours spent sleeping, that's only 4.5-9.5 hours non/low activity per day, keeping in mind desk jobs of 8 hours per day.
Let's see if a less sedentary day affects the amount of sleep obtained.
chart_sleep_active <-
ggplot(data = merge_act_sleep,
aes(x = ActiveHours, y = TotalHoursAsleep)) +
geom_point(color = 'pink') +
geom_smooth(method = 'lm', se = FALSE) +
geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = 6, ymax = 8),
fill = "yellow", alpha = 0.006) + # More transparent highlight
labs(title = 'Total Active Hours vs Total Hours Asleep',
caption = '12th April - 12th May 2016',
x = "Active Hours",
y = "Total Hours Asleep") +
scale_y_continuous(breaks = seq(0, max(merge_act_sleep$TotalHoursAsleep) + 2, by = 2)) +
theme_minimal()
chart_sleep_active
ggsave('chart_sleep_active.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
This appears to show a very weak negative correlation; let's check:
cor_active_sleep <- cor(merge_act_sleep$ActiveHours,
merge_act_sleep$TotalHoursAsleep)
cat("Correlation Coefficient:", cor_active_sleep, "\n")
Correlation Coefficient: -0.1812202
Correlation Coefficient: -0.1812202 - this indicates a weak negative correlation so higher values of one weakly correspond to lower values of the other.
From looking at both of these graphs, the only conclusions I can draw is that the more time spent lightly active/sedentary (Non Active), the less hours of total sleep is obtained.
Allow users to set their goal bed time and send alerts to them, for example 1 hour before, to remind them to prepare for bed. A second alert could be sent 30 minutes before-hand, to encourage them to now begin to engage in calming activities to encourage sleep.
I used ggplot2 and plotly to create an interactive graph showing the calories burned vs Total Steps taken in a day.
# I want a nice color palette for this:
palette_cal_step <- c("#1F78B4", "#33A02C", "#D95F02")
chart_cal_step <- ggplot(data = merge_day) +
geom_point(mapping = aes(x = Calories, y = StepTotal, color = "Data Points"),
size = 3, show.legend = TRUE) +
stat_summary(mapping = aes(x = Calories, y = StepTotal, color = "Median"),
fun = median, geom = "point", size = 1, show.legend = TRUE) +
geom_smooth(mapping = aes(x = Calories, y = StepTotal, color =
"Smoothed Line"), method = 'lm', show.legend = TRUE) +
labs(title = 'Total Calories Burned vs Total Steps',
x = "Calories Burned",
y = "Total Steps",
color = "Legend") +
scale_x_continuous(name = "Calories Burned") +
scale_y_continuous(name = "Total Steps",
breaks = seq(0, max(merge_day$StepTotal) + 2000,
by = 2000),
labels = seq(0, max(merge_day$StepTotal) + 2000,
by = 2000)) +
scale_color_manual(name = "Legend",
values = palette_cal_step,
labels = c("Steps/Cals", "Median", "Linear Model")) +
theme(legend.position = "right")
ggsave('chart_cal_step.png',width=16,height=8)
# Customize legend labels in the interactive plot
charti_cal_step <- plot_ly(data = merge_day) %>%
add_markers(x = ~Calories, y = ~StepTotal, color = I(palette_cal_step[1]),
size = 3, name = "Steps/Cals",
text = ~paste("Calories:", Calories, "<br>Steps:", StepTotal)) %>%
add_markers(x = ~Calories, y = ~StepTotal, color = I(palette_cal_step[2]),
size = 1, name = "Median",
text = ~paste("Calories:", Calories, "<br>Steps:", StepTotal)) %>%
add_trace(x = ~Calories, y = ~predict(lm(StepTotal ~ Calories)),
mode = "lines", color = I(palette_cal_step[3]),
name = "Linear Model",
text = ~paste("Calories:", Calories, "<br>Steps:",
round(predict(lm(StepTotal ~ Calories)), 2))) %>%
layout(title = 'Total Calories Burned vs Total Steps',
xaxis = list(title = "Calories Burned"),
yaxis = list(title = "Total Steps"),
legend = list(title = "Legend",
itemsizing = "constant",
items = list(
list(label = "Steps/Cals",
value = list(color = palette_cal_step[1], size = 12)),
list(label = "Median",
value = list(color = palette_cal_step[2], size = 12)),
list(label = "Linear Model",
value = list(color = palette_cal_step[3], size = 12))
)))
# Display the interactive plot
charti_cal_step
ggsave('charti_cal_step.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' No trace type specified: Based on info supplied, a 'scatter' trace seems appropriate. Read more about this trace type -> https://plotly.com/r/reference/#scatter No trace type specified: Based on info supplied, a 'scatter' trace seems appropriate. Read more about this trace type -> https://plotly.com/r/reference/#scatter
`geom_smooth()` using formula = 'y ~ x'
This appears to show a positive correlation; let's check:
cor_cal_step <- cor(merge_day$Calories,
merge_day$StepTotal)
cat("Correlation Coefficient:", cor_cal_step, "\n")
Correlation Coefficient: 0.5915681
Correlation Coefficient: 0.5915681 - this indicates a moderate positive correlation so higher values of one moderately correspond to higher values of the other.
This shows a positive correlation between steps taken and calories burned.
I analysed the weight (Kg) fluctuation over a month for two users, noting the steps taken in the first visualization and the calories burnt in the second.
chart_step_weight <-
ggplot(data = merge_cal_weight_filtered) +
geom_point(mapping = aes(x = Date, y = WeightKg, color = TotalSteps,
group = Id)) +
geom_line(mapping = aes(x = Date, y = WeightKg, color = TotalSteps,
group = Id)) +
scale_color_gradient(low = "red", high = "green") +
labs(title = "Weight vs Total Steps by Two Users over 1 Month",
caption = '12th April to 12th May 2016',
color = 'Total Steps Taken') +
facet_wrap(~ Id, scales = "free_y")
chart_step_weight
ggsave('chart_step_weight.png',width=16,height=8)
# **********************************************
chart_cal_weight <-
ggplot(data = merge_cal_weight_filtered) +
geom_point(mapping = aes(x = Date, y = WeightKg, color = Calories,
group = Id)) +
geom_line(mapping = aes(x = Date, y = WeightKg, color = Calories,
group = Id)) +
scale_color_gradient(low = "red", high = "green") +
labs(title = "Weight vs Calories Burnt by Two Users over 1 Month",
caption = '12th April to 12th May 2016',
color = 'Calories Burnt') +
facet_wrap(~ Id, scales = "free_y")
chart_cal_weight
ggsave('chart_cal_weight.png',width=16,height=8)
I note that subject two appears to have burnt more calories and taken more steps than subject one. I further note that subject one did not make as much impact on weight loss as subject two - let's check:
# log initial weight of both users
initial_weight <- merge_cal_weight_filtered %>%
filter(Date == as.Date("2016-04-12")) %>%
pull(WeightKg)
# log final weight of both users
final_weight <- merge_cal_weight_filtered %>%
filter(Date == as.Date("2016-05-12")) %>%
pull(WeightKg)
# find percentage weight lost of both users
percentage_weight_loss <-
((initial_weight - final_weight) / initial_weight) * 100
print(percentage_weight_loss)
[1] 0.9599976 2.0979056
Subject one lost under 1% weight while subject two lost over 2% weight; however, we must be aware that we do not know the goals of each subject, nor do we know the calorie intake of each subject.
Add a calorie intake tracker and goal tracker to the application.
chart_step_sedentary <-
ggplot(data = merge_day) +
geom_point(mapping=aes(x = StepTotal, y = NonActiveHours)) +
geom_smooth(mapping=aes(x = StepTotal, y = NonActiveHours), method = 'lm') +
#facet_wrap(~???)
labs(title = "Non Active Hours vs Steps per Day",
x = "Total Daily Steps",
y = "Non Active Hours",
caption='12th April to 12th May 2016')
chart_step_sedentary
ggsave('chart_step_sedentary.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
This appears to show a minor negative correlation; let's check:
cor_step_sedentary <- cor(merge_day$StepTotal,
merge_day$NonActiveHours)
cat("Correlation Coefficient:", cor_step_sedentary, "\n")
Correlation Coefficient: -0.1341473
Correlation Coefficient: -0.1341473 - this indicates a weak positive correlation so there is little to no linear relationship between the two variables.
This could be due to the lack of requirement to take a multitude of steps in order to be active - e.g. weightlifting.
Also to be borne in mind is that a user could be taking a vast amount of steps, but at an extremely leisurely pace and thus the Activity would fall into Sedentary or Lightly Active, which has been counted together as "Non Active" for the purposes of this analysis.
Let's look at Active Hours vs Total Daily Steps
chart_step_active <-
ggplot(data = merge_day) +
geom_point(mapping=aes(x = StepTotal, y = ActiveHours)) +
geom_smooth(mapping=aes(x = StepTotal, y = ActiveHours), method = 'lm') +
#facet_wrap(~???)
labs(title = "Active Hours vs Steps per Day",
x = "Total Daily Steps",
y = "Active Hours",
caption='12th April to 12th May 2016') +
annotate('text', x=10000,y=5,label='Those who have more Active
Mintues generally complete more steps',
color='navy')
chart_step_active
ggsave('chart_step_sedentary.png',width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
This appears to show a positive correlation; let's check:
cor_step_active <- cor(merge_day$StepTotal,
merge_day$ActiveHours)
cat("Correlation Coefficient:", cor_step_active, "\n")
Correlation Coefficient: 0.7335519
Correlation Coefficient: 0.7335519 - this indicates a moderately strong positive correlation so higher values of one moderately correspond to higher values of the other.
From this it would seem that you have little impact on your steps per day by being lightly active/sedentary, whereas you will have greater impact with being fairly/very active, which makes sense.
I have touched on the hypothesis that activity does not necessarily depend on steps taken, and I note that there is no indication of which activity persons are doing when classed in the higher activity levels. Let's explore further and see if we can infer anything in regard to this.
Let's see how much distance is covered per day whilst being classed as "Very Active".
range(int_day$VeryActiveDistance)
summary_vact_dist <- int_day %>%
select(VeryActiveDistance) %>%
summary()
print(summary_vact_dist)
VeryActiveDistance Min. : 0.000 1st Qu.: 0.000 Median : 0.210 Mean : 1.503 3rd Qu.: 2.053 Max. :21.920
Active distance range is 0-21.92. The median is provided as 0.210 with a maximum of 21.920, which is vastly different - let's investigate.
# Count total number of entries in the month
nrow(int_day)
# 940
# count number of entries above 0 for Very Active
int_day %>%
filter(VeryActiveDistance > 0) %>%
nrow()
# 527 - so 413 (43%) people did not have any VeryActiveDistance within the month
# count number of entries above 0 for Moderately Active
int_day %>%
filter(ModeratelyActiveDistance > 0) %>%
nrow()
# 554 - so 386 (41%) people did not have any ModeratelyActiveDistance within the
# month
# count number of entries above 0.210 (median)
int_day %>%
filter(VeryActiveDistance > 0.210) %>%
nrow()
# 469
# Count data above 10
int_day %>%
filter(VeryActiveDistance > 10) %>%
nrow()
# 24
# print the top 3 high distances
int_day %>%
top_n(3, VeryActiveDistance)
# 21.92, 21.66, 13.4
| Id | Date | SedentaryMinutes | LightlyActiveMinutes | FairlyActiveMinutes | VeryActiveMinutes | SedentaryActiveDistance | LightActiveDistance | ModeratelyActiveDistance | VeryActiveDistance |
|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <date> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1624580081 | 2016-05-01 | 1020 | 171 | 63 | 186 | 0.02 | 1.91 | 4.19 | 21.92 |
| 8053475328 | 2016-05-08 | 1073 | 228 | 14 | 125 | 0.00 | 3.66 | 0.59 | 13.40 |
| 8877689391 | 2016-04-30 | 1089 | 223 | 4 | 124 | 0.00 | 4.93 | 0.08 | 21.66 |
Can we infer the type of activity performed by using a threshold?
I will set a threshold for "VeryActiveDistance" to possibly differentiate between stationary and movement-based activities - I have set this as being under 1 to allow for warm up/cool down and general movement around a gym for example.
# Set threshold
distance_threshold <- 0.99
# Create a new column indicating the inferred activity type
int_day$InferredActivity <-
ifelse(int_day$VeryActiveDistance < distance_threshold, "Stationary",
"Movement")
# Count how many users are classed as participating in each inferred activity
int_day %>%
filter(VeryActiveDistance > 0) %>%
group_by(InferredActivity) %>%
summarise(count = n())
| InferredActivity | count |
|---|---|
| <chr> | <int> |
| Movement | 343 |
| Stationary | 184 |
We could consider sending a request/alert to a user whilst they are being classified as "Very" or "Fairly/Moderately" Active, requesting that they declare what type of activity they are doing at that time.
Let's visualize the above investigation onto a graph.
# Create a scatter plot
chart_vact_min_dist <-
ggplot(data = int_day %>%
filter(VeryActiveMinutes > 0),
aes(x = VeryActiveMinutes, y = VeryActiveDistance)) +
geom_point(aes(color = InferredActivity), size = 3) +
geom_smooth(color = 'skyblue', size = 1, method = 'lm') +
labs(title = "Very Active Minutes vs Very Active Distance",
x = "Very Active Minutes",
y = "Very Active Distance",
color = "Inferred Activity") +
scale_color_manual(values = c("Stationary" = "red", "Movement" = "darkgreen")) +
theme_minimal()
chart_vact_min_dist
ggsave('chart_vact_min_dist.png', width=16, height=8)
# Let's check out a density plot with the data
chart_dens_vact_min_dist <-
ggplot(data = int_day %>%
filter(VeryActiveMinutes > 0),
aes(x = VeryActiveDistance)) +
geom_density(fill = "blue", alpha = 0.2) +
labs(title = "Density Plot of Very Active Distance",
x = "Very Active Distance",
y = "Density") +
theme_minimal()
chart_dens_vact_min_dist
ggsave('chart_dens_vact_min_dist.png', width=16, height=8)
Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
The Scatterplot appears to show a moderately strong positive correlation between the time spent as "Very Active" per day and the Distance completed whilst being classified as "Very Active". However, there are a number of outliers.
With the density plot, we can easily see that the majority of data is between 0-5, but it also shows us the range of data as well; we can confirm this:
# Count data between distance of 0 and 5
int_day %>%
filter(VeryActiveDistance >= 0, VeryActiveDistance <= 5) %>%
nrow()
# 865 out 940 of the data between 0 and 5, as shown in the density plot.
Let's see how calories sit in this mix.
chart_vact_cal_dist <-
ggplot(data = merge_int_cal %>%
filter(VeryActiveMinutes > 0)) +
geom_point(mapping=aes(x = VeryActiveDistance, y = Calories)) +
geom_smooth(mapping=aes(x = VeryActiveDistance, y = Calories), method='lm') +
labs(title = "Very Active Distance vs Calories",
x = "Very Active Distance",
y = "Calories",
theme_minimal())
chart_vact_cal_dist
ggsave('chart_vact_cal_dist.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
There does appear to be correlation between calories burnt and the distance traveled, let's check:
cor_vact_cal_dist <- cor(merge_int_cal$VeryActiveDistance,
merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_vact_cal_dist, "\n")
Correlation Coefficient: 0.4919586
Correlation Coefficient: 0.4919586 - this indicates a moderately positive correlation so as one value increase, so does the other (moderately)
Let's plot calories against Very Active Minutes.
chart_vact_cal_min <-
ggplot(data = merge_int_cal %>%
filter(VeryActiveMinutes > 0)) +
geom_point(mapping=aes(x = VeryActiveMinutes, y = Calories)) +
geom_smooth(mapping=aes(x = VeryActiveMinutes, y = Calories), method='lm') +
labs(title = "Very Active Minutes vs Calories",
x = "Very Active Minutes",
y = "Calories",
theme_minimal())
chart_vact_cal_min
ggsave('chart_vact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
There does appear to be correlation between calories burnt and the minutes spent as very active, let's check:
cor_vact_cal_min <- cor(merge_int_cal$VeryActiveMinutes,
merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_vact_cal_min, "\n")
Correlation Coefficient: 0.6158383
Correlation Coefficient: 0.6158383 - this indicates a stronger positive correlation so as one value increases, so does the other; from this, we can hypothesize that it is much more important to increase our time as "Very Active" as opposed to our distance traveled i.e. high intensity weight lifting may burn more calories than long distance running/cycling for example.
Provide information to customers regarding this finding.
Provide weight lifting routines for customers.
I note outliers such as one which shows VeryActiveMinutes below 25 but calories burnt as almost 5,000 - perhaps they are conducting less intensive exercise but in higher quantities i.e. Fairly Active Minutes as opposed to Very Active Minutes, let's take a quick look at that:
chart_mact_cal_min <-
ggplot(data = merge_int_cal %>%
filter(FairlyActiveMinutes > 0)) +
geom_point(mapping=aes(x = FairlyActiveMinutes, y = Calories)) +
geom_smooth(mapping=aes(x = FairlyActiveMinutes, y = Calories), method='lm') +
labs(title = "Fairly Minutes vs Calories",
x = "Fairly Active Minutes",
y = "Calories",
theme_minimal())
chart_mact_cal_min
ggsave('chart_mact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
There is a noticeable difference here in the correlation, let's check:
cor_mact_cal_min <- cor(merge_int_cal$FairlyActiveMinutes,
merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_mact_cal_min, "\n")
Correlation Coefficient: 0.2976235
Correlation Coefficient: 0.2976235 - this indicates a weak positive correlation so as one value increase, the other also increases but weakly; from this, we can tell that higher intensity exercise is much more closely related to calorie burn.
However, I note that same outlier as before near the 5,000 calorie mark but very low in Very and Fairly Active Minutes, let's check Lightly Active Minutes.
chart_lact_cal_min <-
ggplot(data = merge_int_cal %>%
filter(LightlyActiveMinutes > 0)) +
geom_point(mapping=aes(x = LightlyActiveMinutes, y = Calories)) +
geom_smooth(mapping=aes(x = LightlyActiveMinutes, y = Calories), method='lm') +
labs(title = "Lightly Minutes vs Calories",
x = "Lightly Active Minutes",
y = "Calories",
theme_minimal())
chart_lact_cal_min
ggsave('chart_lact_cal_min.png', width=16,height=8)
`geom_smooth()` using formula = 'y ~ x' `geom_smooth()` using formula = 'y ~ x'
The correlation here is similar to that of Fairly Active, let's check:
cor_lact_cal_min <- cor(merge_int_cal$LightlyActiveMinutes,
merge_int_cal$Calories)
cat("Correlation Coefficient:", cor_lact_cal_min, "\n")
Correlation Coefficient: 0.2867175
Correlation Coefficient: 0.2867175 - this indicates a weak positive correlation so as one value increase, the other also increases but weakly; from this, we can tell that higher intensity exercise is much more closely related to calorie burn.
We can see that this outlier near the 5,000 calorie mark is around 300 Lightly Active Minutes, let's examine them and see where they are burning their high level of calories versus a low calorie burning outlier:
# Filter data for the high Calorie burners (over 4500 Calories)
outlier_high <- merge_int_cal %>%
top_n(5, wt = Calories) %>%
mutate(Group = "High Calorie Burn")
# Filter data for the low Calorie burners (under 500 Calories)
outlier_low <- merge_int_cal %>%
top_n(-5, wt = Calories) %>%
mutate(Group = "Low Calorie Burn")
# Combine the dataframes into one called "outliers"
outliers <- bind_rows(outlier_high, outlier_low) %>%
select(Id, Date, Group, everything())
# Let's plot it
chart_cal_sed <-
ggplot(outliers, aes(x = Group, y = SedentaryMinutes, fill = Group)) +
geom_violin() +
labs(title = "Comparison of Sedentary Minutes between High and Low
Calorie Burners",
x = "Group",
y = "Sedentary Minutes",
fill = "Calorie Burn Group") + # Add the fill scale label
scale_fill_manual(values = c("High Calorie Burn" = "darkgreen",
"Low Calorie Burn" = "red")) +
theme_minimal()
chart_cal_sed
ggsave('chart_cal_sed.png', width=16,height=8)
It's clear to see with the low calorie burning group, that a significant portion of this group has similar levels of sedentary behaviour, reaching almost 1500 minutes; a consistent lifestyle/routine is suggested here.
The high calorie burn group shows a wider distribution of sedentary minutes; it would appear that the majority have between 500-650 with a smaller proportion having up to 900 sedentary minutes.
The data presents a number of difficulties due to the low number of participants, and the suspicion that not all data/applications were utilized and thus not all data was available.
Some further research should be conducted in regards to the logging of hydration throughout the day as we have no information regarding that.
There are some other potential marketable ranges to explore in the Bellabeat range as well such as teaming up with "Calm" or similar to encourage better sleep patterns.
Other recommendations included adding Smart Scales to the Bellabeat range and encouraging a more positive relationship with the logging of weight as well as providing more informative and engaging content to help our customers achieve their goals and obtain strength and confidence from the feeling of being able to independently achieve this.